AITopics | hard learning problem

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

Neural Information Processing SystemsNov-20-2025, 21:47:42 GMT

We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model.

hard learning problem, statistical optimality, stochastic gradient descent, (7 more...)

Neural Information Processing Systems

Industry: Education > Focused Education > Special Education (0.43)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (0.67)

Add feedback

Reviews: Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

Neural Information Processing SystemsOct-7-2024, 05:11:12 GMT

This paper identifies and separates (kernel) linear least-squares regression problems wherein carrying out multiple passes of stochastic gradient descent (SGD) over a training set can yield better statistical error than only a single pass. This is relevant to the core of machine learning theory, and relates to a line of work published at NIPS, ICML, COLT, and similar conferences in the past several years about the statistical error of one-pass, many-pass, and ERM-based learning. The authors focus on regression problems captured, by assumption, by two parameters: alpha, which governs the exponent of a power-law eigenvalue decay, and r, which governs a transformation under which the Hilbert norm of the optimal predictor is bounded. They refer to problems where r (alpha - 1) / (2 * alpha) as "hard". The main result of the paper is to show that for these "hard" problems, multiple SGD passes either achieve (minimax) optimal rates of statistical estimation, or at least improve the rate relative to a single pass. The results are interesting and might address an unanswered core question in machine learning, and the mathematical presentation is clear, with assumptions upfront.

hard learning problem, statistical optimality, stochastic gradient descent, (11 more...)

Neural Information Processing Systems

Industry: Education > Focused Education > Special Education (0.47)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

Pillaud-Vivien, Loucas, Rudi, Alessandro, Bach, Francis

Neural Information Processing SystemsFeb-14-2020, 20:11:59 GMT

We consider stochastic gradient descent (SGD) for least-squares regression with potentially several passes over the data. While several passes have been widely reported to perform practically better in terms of predictive performance on unseen data, the existing theoretical analysis of SGD suggests that a single pass is statistically optimal. While this is true for low-dimensional easy problems, we show that for hard problems, multiple passes lead to statistically optimal predictions while single pass does not; we also show that in these hard models, the optimal number of passes over the data increases with sample size. In order to define the notion of hardness and show that our predictive performances are optimal, we consider potentially infinite-dimensional models and notions typically associated to kernel methods, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariance matrix. We illustrate our results on synthetic experiments with non-linear kernel methods and on a classical benchmark with a linear model.

hard learning problem, statistical optimality, stochastic gradient descent, (5 more...)

Neural Information Processing Systems

Genre: Research Report (0.44)

Industry: Education > Focused Education > Special Education (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning > Gradient Descent (1.00)

Add feedback

Filters

Collaborating Authors

hard learning problem

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

Reviews: Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes